Business Running Case: Evaluating Personal Job Market Prospects in 2024


📌 Introduction

In this project, we analyzed the “lightcast_job_postings.csv” dataset, which contains detailed job market information, including job titles, companies, locations, salaries, and various metadata. The dataset initially comprised 131 columns, offering a comprehensive view of job postings and associated attributes for evaluating personal job market prospects in 2024.

import pandas as pd
import plotly.express as px
import missingno as msno
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import us 
df = pd.read_csv("lightcast_job_postings.csv")
/var/folders/2c/f5k90sqd3vdfdh_s99mdr09w0000gn/T/ipykernel_70179/3047231268.py:1: DtypeWarning: Columns (3,6,16,19,22,24,26,28,29,30,32,36,40,42,44,46,48,50,52,54,56,58,60,62,95,97,99,101,103,105,107,109,121) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("lightcast_job_postings.csv")
df.head()
ID LAST_UPDATED_DATE LAST_UPDATED_TIMESTAMP DUPLICATES POSTED EXPIRED DURATION SOURCE_TYPES SOURCES URL ... NAICS_2022_2 NAICS_2022_2_NAME NAICS_2022_3 NAICS_2022_3_NAME NAICS_2022_4 NAICS_2022_4_NAME NAICS_2022_5 NAICS_2022_5_NAME NAICS_2022_6 NAICS_2022_6_NAME
0 1f57d95acf4dc67ed2819eb12f049f6a5c11782c 9/6/24 2024-09-06 20:32:57.352 Z 0 6/2/24 6/8/24 6.0 [\n "Company"\n] [\n "brassring.com"\n] [\n "https://sjobs.brassring.com/TGnewUI/Sear... ... 44 Retail Trade 441.0 Motor Vehicle and Parts Dealers 4413.0 Automotive Parts, Accessories, and Tire Retailers 44133.0 Automotive Parts and Accessories Retailers 441330.0 Automotive Parts and Accessories Retailers
1 0cb072af26757b6c4ea9464472a50a443af681ac 8/2/24 2024-08-02 17:08:58.838 Z 0 6/2/24 8/1/24 NaN [\n "Job Board"\n] [\n "maine.gov"\n] [\n "https://joblink.maine.gov/jobs/1085740"\n] ... 56 Administrative and Support and Waste Managemen... 561.0 Administrative and Support Services 5613.0 Employment Services 56132.0 Temporary Help Services 561320.0 Temporary Help Services
2 85318b12b3331fa490d32ad014379df01855c557 9/6/24 2024-09-06 20:32:57.352 Z 1 6/2/24 7/7/24 35.0 [\n "Job Board"\n] [\n "dejobs.org"\n] [\n "https://dejobs.org/dallas-tx/data-analys... ... 52 Finance and Insurance 524.0 Insurance Carriers and Related Activities 5242.0 Agencies, Brokerages, and Other Insurance Rela... 52429.0 Other Insurance Related Activities 524291.0 Claims Adjusting
3 1b5c3941e54a1889ef4f8ae55b401a550708a310 9/6/24 2024-09-06 20:32:57.352 Z 1 6/2/24 7/20/24 48.0 [\n "Job Board"\n] [\n "disabledperson.com",\n "dejobs.org"\n] [\n "https://www.disabledperson.com/jobs/5948... ... 52 Finance and Insurance 522.0 Credit Intermediation and Related Activities 5221.0 Depository Credit Intermediation 52211.0 Commercial Banking 522110.0 Commercial Banking
4 cb5ca25f02bdf25c13edfede7931508bfd9e858f 6/19/24 2024-06-19 07:00:00.000 Z 0 6/2/24 6/17/24 15.0 [\n "FreeJobBoard"\n] [\n "craigslist.org"\n] [\n "https://modesto.craigslist.org/sls/77475... ... 99 Unclassified Industry 999.0 Unclassified Industry 9999.0 Unclassified Industry 99999.0 Unclassified Industry 999999.0 Unclassified Industry

5 rows × 131 columns

print(df.shape)
(73101, 131)

📌 Data Cleaning & Preprocessing

columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]
df.drop(columns=columns_to_drop, inplace=True)

To prepare the dataset for analysis, we undertook a thorough data cleaning and preprocessing process, including:

  1. Dropping Irrelevant Columns:
  • We removed columns that were either redundant or not relevant to our analysis. These included unique identifiers (ID), URLs of job postings (URL, ACTIVE_URLS), and columns providing less granular versions of NAICS and SOC codes (NAICS2, SOC_3, etc.).
  • The rationale for dropping these columns was to enhance data efficiency and clarity, reduce the dataset size, and focus on the most granular and meaningful data.
# Visualize missing data
msno.heatmap(df)

# Drop columns with >50% missing values
df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)

# Fill missing values safely
df["NAICS_2022_6"] = df["NAICS_2022_6"].fillna(df["NAICS_2022_6"].median())
df["NAICS_2022_6_NAME"] = df["NAICS_2022_6_NAME"].fillna("Unknown")

Missing Data Visualization
  1. Handling Missing Values:
  • As seen in the heatmap above, it reveals that the columns ACTIVE_SOURCES_INFO, MODELED_DURATION, and MODELED_EXPIRED contain significant missing values, suggesting potential data collection or extraction issues. Notably, DURATION and EXPIRED exhibit a moderate positive correlation (0.5), indicating that longer job durations might influence missingness in related columns.
  • Additionally, ACTIVE_SOURCES_INFO shows a strong negative correlation with other variables, implying a pattern where missing data in this column might coincide with gaps in others.
  • In contrast, columns like MSA, MSA_NAME, and LIGHTCAST_SECTORS have no missing values, providing a reliable foundation for further analysis. Addressing missing data in correlated columns through targeted imputation can prevent bias and enhance the accuracy of our analysis.
df = df.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")
  1. Removing Duplicates:
  • To maintain unique job postings, duplicates were removed based on job title, company, location, and posting date.
  • This ensured that each job opportunity was only represented once, preventing skewed analysis results.

Impact of Data Cleaning

The data cleaning process resulted in a more streamlined and manageable dataset, eliminating redundancy and potential inconsistencies. This step set the foundation for accurate and insightful analysis in the subsequent phases of the project.


📌 Exploratory Data Analysis (EDA)

Top 15 Job Posting Industries

top_industries = df["NAICS_2022_6_NAME"].value_counts().nlargest(15)
fig = px.pie(
    names=top_industries.index, 
    values=top_industries.values, 
    title="Top 15 Job Posting Industries"
)
fig
Unable to display output for mime type(s): application/vnd.plotly.v1+json

Top 15 Job Posting Industries

In our analysis, we looked at the top industries with the most job postings to get a sense of where the demand is highest. The pie chart shows that 22.6% of the postings fall under the Unclassified Industry category, suggesting that many roles either span multiple sectors or lack clear classification. This made us consider the potential limitations in how industries are labeled in the data.

We also noticed a strong demand for skills in technology and consulting. For example, Custom Computer Programming Services made up 12.1% of the postings, while Management Consulting Services accounted for 11.3%. This highlights a significant need for both tech and management skills in today’s job market.

Interestingly, several tech-focused industries, such as Software Publishers and Computer Systems Design Services, showed up prominently in the chart. This aligns with the growing demand for IT and consulting professionals, which didn’t surprise us given the ongoing digital transformation across industries.

We also found that finance and healthcare sectors have a notable share of job postings, indicating steady demand in these fields. On the flip side, areas like Accounting and Temporary Help Services had fewer listings, suggesting these might be more niche markets.


Remote vs. On-Site Jobs (Data Roles)

fig = px.pie(df_roles, names="REMOTE_TYPE_NAME", title="Remote vs. On-Site Jobs")
fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json

Remote vs. On-Site Jobs

We created a Pie Chart to explor the distribution of remote, on-site, and hybrid roles for data-related jobs. The chart shows that 73.8% of postings do not specify a preference, suggesting either data gaps or employer flexibility. Among specified roles, 20.4% are remote, indicating a strong demand for remote work. Hybrid roles account for 4%, while fully on-site roles are only 1.79%.

These findings suggest a clear shift towards remote and flexible work arrangements in the data field. Highlighting remote work skills could be advantageous for job seekers in this area.


📌 Key Findings

  1. Industry Demand:
  • Programming Services, Consulting Services, and Insurance emerged as leading industries for data-related roles, indicating a widespread need for data skills beyond traditional tech sectors.
  • The prominence of Unclassified Industry suggests potential gaps in data classification or a diverse range of roles that do not fit into conventional categories.
  1. Geographical Distribution:
  • California, Texas, and Florida were identified as major hubs for data-related jobs, while several Midwestern and Mountain states showed fewer opportunities.
  • This pattern suggests that professionals may find more job prospects by focusing on these high-demand states.
  1. Work Arrangement Preferences:
  • A significant share of job postings preferred remote and hybrid roles, with over 20% specifically offering remote options.
  • The limited proportion of fully on-site roles reflects a broader shift towards flexible work models in the data industry.
  1. Role-Specific Trends:
  • There is a clear upward trend in demand for roles such as Big Data Analysts and Business Intelligence Analysts, highlighting the growing importance of both technical and analytical skills.
  • Specialized roles like Clinical Data Analysts and Customer Data Analysts are also gaining traction, indicating expanding opportunities in niche areas.

📌 Conclsuion

Our analysis revealed that the demand for data-related skills is both substantial and diverse, spanning multiple industries and regions across the United States. Key sectors such as tech, consulting, and insurance show the most significant opportunities, while states like California, Texas, and Florida lead in job postings.

The strong preference for remote and hybrid roles highlights the importance of flexibility in the current job market. Meanwhile, the consistent rise in demand for specialized data roles suggests a promising outlook for professionals equipped with both analytical and domain-specific skills.

Overall, these findings suggest that focusing on high-demand industries, enhancing remote work capabilities, and acquiring specialized skills can significantly boost job prospects in the data field.